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Research on performance, robustness, and evolution of the 
global Internet is fundamentally handicapped without accu- 
rate and thorough knowledge of the nature and structure 
of the contractual relationships between Autonomous Sys- 
tems (ASs). In this work we introduce novel heuristics for 
inferring AS relationships. Our heuristics improve upon pre- 
vious works in several technical aspects, which we outline in 
detail and demonstrate with several examples. Seeking to 
increase the value and reliability of our inference results, we 
then focus on validation of inferred AS relationships. We 
perform a survey with ASs' network administrators to col- 
lect information on the actual connectivity and policies of 
the surveyed ASs. Based on the survey results, we find 
that our new AS relationship inference techniques achieve 
high levels of accuracy: we correctly infer 96.5% customer 
to provider (c2p), 82.8% peer to peer (p2p), and 90.3% sib- 
ling to sibling (s2s) relationships. We then cross-compare 
the reported AS connectivity with the AS connectivity data 
contained in BGP tables. We find that BGP tables miss 
up to 86.2% of the true adjacencies of the surveyed ASs. 
The majority of the missing links are of the p2p type, which 
highlights the limitations of present measuring techniques 
to capture links of this type. Finally, to make our results 
easily accessible and practically useful for the community, 
we open an AS relationship repository where we archive, on 
a weekly basis, and make publicly available the complete 
Internet AS-level topology annotated with AS relationship 
information for every pair of AS neighbors. 

Categories and Subject Descriptors 

C.2.5 [Local and Wide-Area Networks]: Internet; C.2.1 
[Network Architecture and Design]: Network topology 

General Terms 

Measurement, Verification 
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1. INTRODUCTION 

The global Internet routing system is composed of thou- 
sands of Autonomous Systems (ASs) that operate individual 
parts of the Internet infrastructure. ASs engage in a variety 
of relationships to collectively and ubiquitously route traffic 
in the Internet. These relationships are usually realized in 
the form of business agreements that, in turn, translate into 
engineering constraints on traffic flows within and across in- 
dividual networks. 

Understanding the underlying business AS relationships 
plays a critical role in many research and operational tasks 
ranging from realistic simulations of packets routed in the 
Internet to selection of peers or upstream providers based on 
connectivity and AS relationships of candidate ISPs. Fur- 
ther, statistical data on these relationships are useful for de- 
velopment of more advanced interdomain routing protocols 
and architectures that take into account the presence of AS 
relationships to improve their performance [27] . Moreover, 
business behavior patterns of Internet players influence di- 
rections of ISPs' collaboration and ultimately the evolution 
of the macroscopic infrastructure of the Internet. 

In this study we follow previous works |15l 1261 (4] [14] in 
considering the following three major categories of AS re- 
lationships: customer-to-provider (c2p), peer-to-peer (p2p), 
and sibling-to-sibling (s2s). In the c2p category, a customer 
AS pays a provider AS for any traffic sent between the twoQ 
In the p2p category, two ASs freely exchange traffic between 
themselves and their customers, but do not exchange traffic 
from or to their providers or other peers. In the s2s cate- 
gory, two ASs administratively belong to the same organi- 
zation and freely exchange traffic between their providers, 
customers, peers, or other siblings. 

Our work makes the following contributions: 

1. We introduce novel heuristics for inferring c2p, p2p, 
and s2s relationships. Our heuristics improve the state- 
of-the-art in several technical aspects, one of them be- 
ing a more realistic problem formulation that accepts 
that AS paths do not always exhibit a hierarchical pat- 
tern. We demonstrate using several examples our en- 



^ We use acronym c2p to refer to customer to provider re- 
lationships in general, as well as to links A-B, where AS A 
is a customer of AS B. In contrast, we use acronym p2c to 
refer only to links A-B, where AS A is a provider of AS B. 



hancements that lead to more accurate inference re- 
sults. 

2. We conduct a survey with organizations operating ASs, 
from which we retrieve company-verified information 
about the actual types of relationships they have with 
other networks. We use this information to validate 
the AS relationships we infer and find that they are 
highly accurate. To our knowledge, this study is the 
most exhaustive AS relationship validation effort to 
date. 

3. Using company-verified data we confirm previous mea- 
surement results ^ [23] on the poor coverage of AS 
topologies. In addition, we verify the commonly held 
assumption that most of the missing links are of p2p 
type. 

4. To promote further analysis and discussion of the macro- 
scopic Internet topology, we introduce a publicly avail- 
able AS relationships repository [7]. We automate our 
heuristics and archive datasets of annotated AS links 
on a weekly basis. We also compute and publish rank- 
ing of ASs based on inferred AS relationship hierar- 
chies [S]. 

This paper follows our earlier work 13 on inferring c2p 
relationships. It addresses the issue left open of how to select 
the most realistic from the candidate solutions to our c2p 
problem formulation. It then extends our previous work by: 
1) introducing new heuristics for the inference of p2p and s2s 
relationships, 2) validating our inferences, and 3) developing 
an open AS relationships repository. 

We organize the paper as follows. In the next section we 
introduce and describe in detail our heuristics. We com- 
pare our approach to inferring AS relationships with pre- 
vious ones and discuss our improvements. In section [3] we 
apply the developed heuristics to Internet data and fully 
annotate a snapshot of the AS topology with the computed 
types of relationships. We also briefly discuss our ranking 
of ASs based on inferred AS relationship hierarchies. In sec- 
tion 131 we describe the results of our AS survey, validate 
our heuristics, and analyze the true AS relationships that 
we learned from the participating ASs. Finally, we conclude 
in section [S] 

2. INFERENCE HEURISTICS 
2.1 Preliminaries 

Gcio's seminal work [15] was the first to formulate and 
systematically study the AS relationships inference prob- 
lem. Gao assumed that every BGP path must comply with 
the following hierarchical pattern: an uphill segment of zero 
or more c2p or s2s links, followed by zero or one p2p links, 
followed by a downhill segment of zero or more p2c or s2s 
links. Paths with this hierarchical structure are valley-free 
or valid. Paths that do not follow this hierarchical structure 
are called invalid and may result from BGP misconfigura- 
tions or from BGP policies that are more complex and do 
not distinctly fall into the c2p/p2p/s2s classification. Fol- 
lowing this definition of valid paths, Gao proposed an infer- 
ence heuristic (which we denote as GAO) that identified top 
providers and peering links based on AS degrees and valid 
paths. 

Following Gao's work, Subramanian et al. i26| developed 
a mathematical formulation of the inference problem. They 
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Figure 1: An instance of the ToR. problem that does not 
admit a solution. Each circle is marked with a tuple X : Y, 
where X is the AS number and Y is the AS degree seen in 
our AS topology. The paths at the bottom yield the ToR, 
instance. 



cast the inference of AS relationships into the Type of Rela- 
tionship (ToR) combinatorial optimization problem: given 
a graph G{V, E) derived from a set of BGP paths P, assign 
the edge type (c2p or p2p; s2s relationships are ignored) to 
every edge i £ E such that the total number of valid paths 
in P is maximized. The authors speculated that ToR is NP- 
complete and developed a heuristic solution, which we refer 
to as SARK below. 

The SARK approach takes as input the BGP tables col- 
lected at different vantage points and computes a rank for 
every AS. This rank is a measure of how close to the graph 
core an AS lies and it is equivalent to node coreness [3] 
[2]. The heuristic then infers AS relationships by comparing 
ranks of adjacent ASs. If the ranks are similar, the algorithm 
classifies the link as p2p, otherwise it is c2p. 

Di Battista et al. |3] and Erlebach et al. [14) independently 
showed that ToR is indeed NP-complete, developed mathe- 
matically rigorous approximate solutions to the problem and 
proved that it is impossible to infer p2p relationships under 
the ToR formulation framework. For this reason, their solu- 
tions (referred to as DPP and EHS) infer c2p relationships 
only and ignore p2p and s2s relationships. 

Despite the ToR formulation being substantially studied, 
we find that it bears limitations that lead to incorrect in- 
ferences. We describe these limitations with the following 
examples. 

Example 1. Ignoring s2s relationships causes pro- 
liferation of erroneous inferences. Consider a path p = 
{ij} G P that includes an edge i appearing in multiple paths, 
and an edge j, appearing only in this path p. Suppose that 
in reality i is a sibling edge and j is a c2p edge. It is conve- 
nient to represent a c2p edge annotation by making the edge 
directed from the customer AS to the provider AS. Depend- 
ing on the structure of other paths containing the sibling 
edge i, a ToR solution can direct it as either c2p or p2c. If 
it gets directed as p2c, then to make the path p valid, the 
algorithm has to erroneously direct the edge j as p2c, too. 

Example 2. A solution maximizing the number of 



inferred-to-be- valid paths is not necessarily correct. 

Consider the real-world instance of the ToR problem in Fig- 
ure[TJ which was used in [3] to introduce ToR. In this setting 
there are four distinct combinations of edge orientations, 
each maximizing the number of valid paths, but rendering 
one of the paths PI, P2, P4, or P5 as invahd. In path P6, AS 
5056 with degree 11 appears to transit traffic between large 
providers, AS 701 (UUNET) and AS 1239 (Sprint). A ToR 
solution will treat path P6 as valid. Thus, it must infer AS 
5056 as a provider of either UUNET or Sprint, or a provider 
of both, all of which are incorrect. The key point of this ex- 
ample is that while it is reasonable to assume that most AS 
paths in the Internet have a valid hierarchical structure, it 
is still possible that some paths in real networks have a non- 
hierarchical invalid structure. An attempt to annotate such 
paths based on the valley-free model will result in unrealistic 
relationships. 

Example 3. In cases when there are multiple solu- 
tions with the same number of valid paths, ToR has 
no means to deterministically select the most realis- 
tic solution. Instead, it has to randomly attribute validity 
to one of the available solutions. Consider path p £ P that 
is a sequence of edges ii,i2, ■ ■ ■ ,i\p\-i,j £ E. Suppose that 
the last edge j appears only in this one path p and that it 
is from a large provider (such as UUNET) to a small cus- 
tomer. Suppose that other edges 11,12, . . . ,i\p\-i appear in 
several other paths and are correctly inferred as c2p. In 
this scenario both orientations of the edge j (i.e. correct 
p2c and incorrect c2p) render path p valid. Thus, this edge 
cannot receive a deterministic direction from the ToR so- 
lution. This example explains why Rimondini |;24j found 
several well-known large providers such as AT&T, Sprint, 
Levels to be inferred as customers of smaller ASs such as 
AS2685 (degree 2), AS8043 (1) and AS13649 (7), respec- 
tively. We also observed incorrect inferences of this type in 
our experiments. 

The above examples illustrate that: 1) it is necessary to 
account for s2s relationships; 2) trying to simply maximize 
the number of valid AS paths may result in incorrect AS 
relationship inferences; and 3) without additional informa- 
tion, the ToR framework by itself is insufficient to ensure a 
deterministic inference of AS relationships. 

In the following subsections we address these shortcom- 
ings and present heuristics to determine AS relationships 
more accurately. 

2.2 Inferring s2s relationships 

Sibling links connect ASs belonging to the same organiza- 
tion, and thus communication between s2s ASs is not sub- 
ject to the export restrictions found in c2p and p2p rela- 
tionships. For example, rules such as "prefixes learned from 
a peer cannot be announced to other peers" do not apply 
for sibling ASs. Therefore, sibling ASs can implement much 
more flexible and diverse policies than non-affiliated ASs, 
making it very difficult to infer s2s relationships from BGP 
data. For this reason we utilize the IRR databases to anno- 
tate s2s links. We then remove s2s edges from both graph G 
and path set P to avoid proliferation of incorrect c2p infer- 
ences. In effect, we abrogate the limitations of Example 1 
by independently inferring s2s relationships. 

Specifically, we track the organization to which each AS 
is registered in the databases and create groups of sibling 
ASs registered to the same organization. In several cases 



sibling ASs are registered to syntactically different organi- 
zational names, which still represent the same organization 
by other measures. For example, ASs 7018 and 3339 are reg- 
istered to "AT&T WorldNet Services" and "AT&T Israel" , 
respectively. To find such cases, we examine the organiza- 
tion names and manually create a dictionary of organization 
name synonyms. Then, we infer as s2s the ASs that are reg- 
istered to the same organization name or to synonymous 
organization names. 

The strength of our approach is that it takes advantage 
of explicit information contained in the IRR databases. Al- 
though we realize these databases are not always up-to-date 
or perfectly accurate, the organization names change less 
frequently than BGP policies and other more dynamic at- 
tributes. We can therefore treat the IRR databases as a 
source of publicly available information, which is reasonably 
accurate for the purpose of inference of s2s relationships. 

2.3 Improving the integrity of c2p inferences 

In Example 2 we demonstrated that trying to maximize 
the number of inferred-to-be-valid paths can lead to incor- 
rect inferences since in reality AS paths are not always hi- 
erarchical. To address this limitation we construct a c2p- 
inference heuristic that is based on the idea of relaxing the 
requirement for a maximal number of valid paths and using 
the AS degree information to detect paths that are invalid 
and that we should not try to direct as valid. We formalize 
this idea as follows. 

For every edge i € E we introduce a weight /(d^ , df) that 
is a function of the degrees d~ and df {d~ < df) of the ASs 
adjacent to the edge i. The weight / is large when there 
is a significant degree difference (d~ <^ df) between these 
neighboring ASs, and small otherwise. In directing the edges 
of the graph, we use / in the following way: when an edge 
i is directed from a small-degree AS to a large-degree AS, 
it earns a bonus hi equal to f{d~,df), otherwise bi = 0. 
We then formulate the inference problem as the following 
multiobjective optimization problem: 

Oi Maximize the number of valid paths in P; 
O2 Maximize the sum X^igB ^i- 

These two methodological objectives can be conflicting. 
Consider again the Example 3 using Figure [T] According 
to the objective Oi , at least one of the edges 1239-5056 or 
5056-701 in P6 must be directed against the node degree 
gradient in order to render P6 valid. By introducing the 
second objective O2, we relax the first objective's require- 
ment for the maximal number of valid paths. We can thus 
accept an "invalid" orientation for P6 based on the strong 
degree-gradient indication (O2) that neither 1239 nor 701 
are customers of 5056. 

This formulation combines the strengths of previous works. 
First, it is similar to SARK, DPP and EHS, in that it re- 
spects the valley-free model and tries to maximize the num- 
ber of valid paths in the input path set P. Secondly, it is sim- 
ilar to GAO, in that it uses the implicit knowledge embed- 
ded in AS degree information to assign directions to edges 
along the node degree gradient by giving certain weighted 
preference to edge orientations coUinear with this gradient. 

To solve the newly formulated optimization problem, we 
map the c2p or p2c relationship of edge i to boolean vari- 
able Xi as follows: assuming an arbitrary initial direction 
of i, an assignment of true to Xi means that edge i keeps 



its original direction, while an assignment of false to Xi re- 
verses the direction of i. We find assignments to variables Xi 
by reducing the multiobjective optimization problem to the 
well-known MAX2SAT problem. 

MAX2SAT is a boolean algebra problem: given a set of 
clauses with two boolean variables per clause li V Ij , find an 
assignment of values to variables maximizing the number 
of simultaneously satisfied clauses [16) . If the clauses are 
weighted, the problem is to maximize the sum of weights 
of the simultaneously satisfied clauses. MAX2SAT is NP- 
complete, however, the semidefinite programming (SDP) ap- 
proach [l7| delivers an approximate answer that differs from 
the exact answer by not more than a factor of 0.94. 

To reduce the objective Oi (ToR) to MAX2SAT we use 
the approach of DPP and EHS [H [14]. This gives a set 
of Xi V Xj clauses, where i,j £ E. 

To reduce the objective O2 to MAX2SAT, we introduce a 
clause Xi V Xi for every edge i £ E that has an initial direc- 
tion along the node degree gradient, and a clause Xi V Xi for 
every edge with an initial direction against the node degree 
gradient. We thus ensure that if an edge is directed along 
the node degree gradient, then the corresponding clause is 
satisfied. To make our MAX2SAT instance equivalent to O2, 
we weight every clause by bi = f{d~ ,df). 

We then reduce the resulting multiobjective optimization 
problem to MAX2SAT by refining the weights of the clauses. 
We introduce a parameter a and weight the objective Oi 
by a and the objective O2 by 1 — a: 



struct / as: 



w,j{a) 



C\a for 0\ clauses, 

C2(l — a)f{df,d^) for O2 clauses. 



(1) 



The normalization coefficient ci is determined from the con- 
dition X]i=^7 ^iil*^) = a => ci = 1/mi, where mi is the 
number of Oi clauses. The normalization coefficient C2 is 
determined from the condition '^iWii{a) = 1 — a. Vary- 
ing a in the region between and 1 controls the relative 
preference of the two objectives^ We explore the tradeoff 
between the objectives Oi and O2 and adjust a to the region 
or the point that results in the most accurate AS relation- 
ship inferences (cf. discussion of the optimal value of a in 
section I3.2|l . 

Function / encodes dependence on AS degrees into our 
inference process. This function should take large values 
when its two degree arguments differ significantly, other- 
wise its values should be small, because neighboring ASs 
with significant size difference typically have a customer to 
provider relationship and AS size is strongly correlated to 
AS degree J284. We note that a given absolute difference in 
AS degrees is of different importance for small ASs and for 
large ASs. For example, a degree difference of 50 says more 
about the relative size of two ASs of degrees 1 and 51, than 
of 3000 and 3050. To account for this relative importance, 
we normalize the degree difference in / to the relative node 
degree gradient {df — d~)/{df + d~). In addition, topology 
graphs derived from BGP data provide only approximations 
of the true AS degrees. They tend to underestimate degrees 
of small ASs but yield more accurate degree approximations 
for larger ASs [9]. To model this effect, we introduce a loga- 
rithmic factor reflecting our stronger confidence in accuracy 
of large AS degrees, compared to small ones. We thus con- 



f{dt,d- 



d+ ~ d: 



d+ + d 



'-log{dt+dr). 



(2) 



In summary, our formulation of the c2p relationship in- 
ference problem exploits the structure of the AS paths to 
address the limitations that we illustrated in Examples 2 
and 3 of section 12.11 



2.4 Inferring p2p relationships 

The inference of p2p relationships is more challenging 
than the inference of c2p relationships. As both DPP and 
EHS show, it is impossible to infer p2p relationships within 
the ToR formulation framework. Indeed, a valid path can 
have only one p2p link adjacent to the top provider in the 
path. If we replace this p2p link with a c2p or p2c link, the 
path remains valid, as it still has a valley-free, hierarchical 
structure. Therefore, maximizing the number of valid paths 
as is done by ToR, one cannot deterministically infer any 
p2p relationships at all. Confirming the difficulty of infer- 
ring p2p relationships comes a work by Xia and Gao [30] . 
who find that GAO and SARK's p2p inference heuristics 
yield a low accuracy of, respectively, 49.08% and 24.63% of 
correct p2p inferences. 

To improve the inference of p2p relationships, we develop 
a heuristic that combines GAO and DPP strengths. We 
start from a set of BGP paths P and extract a graph G 
from it. Then we preprocess P to identify links that are not 
of p2p type (non-p2p). 

According to the valley-free model, a path can have at 
most one p2p link and this link must be adjacent to the 
top provider of the path. We thus parse all paths in P 
and denote all links that are not adjacent to the highest 
degree AS in a path as non-p2p. This approach is similar 
but not identical to GAO. GAO assumed that 1) a p2p link 
can lie only between the highest degree AS in a path and its 
highest degree neighbor and 2) that the degree ratio between 
the two edge ASs of a p2p link is smaller than an external 
parameter (discussed below). This method is aggressive in 
excluding non-p2p links. To illustrate, consider an AS path 
A-B-C-D, where AS degrees are dA = 10, da — 500, dc = 
1000, and do = 501. GAO aUows only hnk C-D to be of 
p2p type and denotes the others as non-p2p. However, the 
degree difference between B and D is too small to make this 
judgment reliably. Our heuristic addresses this shortcoming 
by including both B-C and C-D as candidate p2p links. We 
denote by R the set of possible p2p edges constructed this 
way. 

We then introduce a weight g{d~,df) for every edge i £ 
R. Weight g is large when the ASs adjacent to the edge i 
have similar degrees, and small otherwise. Such weighting 
expresses our higher confidence that a pair of neighboring 
ASs are peers when their degrees are similar. Our selected 
weight g complements the weight / used for the inference of 
c2p links: 



g{di ,dt) = 1 - c:if{di ,d+), 



(3) 



^In the terminology of multiobjective optimization TI], we 
consider the simplest scalar method of weighted sums. 



where C3 = l/maxig_B /(d^, d^) is a normalization coeffi- 
cient. 

Next, we remove from R any links that connect ASs with 
large degree differences d~ <C df . More specifically, we in- 
troduce a threshold We £ [0, 1] and remove every edge i 
with g{d~,df) < We- The GAO heuristic used an empiri- 



cally selected value of 60 or oo for a similar threshold. We 
improve upon this approach by using information learned 
from our survey (see section |3| to choose a proper value 
for We. Namely, for each true p2p and c2p link present both 
in our survey results and in R, we examine what selection 
of We leads to: 1) erroneously excluding a true p2p link from 
the set of possible p2p links R, meaning that g{d~ , df) < We 
for a true p2p link; and 2) erroneously not excluding a true 
c2p link from the set of possible p2p links R, meaning that 
g{d~ , df)>We for a true c2p link. We find that the value 
of We that minimizes errors is (7(3,545). The need for ex- 
ternal threshold We is unfortunate, but the large degree dif- 
ference between d~ = 3 and d""" = 545 indicates that this 
threshold simply cleans R of links that are unlikely to be of 
p2p type. 

At the last step of our p2p inference process, we examine 
those paths in P that contain more than one edge from R. 
Such paths violate the valley-free model, and we need to 
classify some links from R as non-p2p in order to resolve 
this violation. DPP showed that the problem of finding a 
maximal set of p2p links that do not introduce invalid paths 
in P is equivalent to the Maximum Independent Set (MIS) 
problem. In the MIS formulation, we are given a graph with 
nodes in A*' and arcs in A and we need to find the maximum 
subset of N such that no two nodes of the subset are joined 
by an arc in A. To increase the reliability of the p2p link 
determination, we utilize our assigned link weights g and 
turn the MIS problem into the Maximum Weight Indepen- 
dent Set (MWIS) problem. In the MWIS formulation, we 
give preference to edges with large weights because we know 
that these edges are more likely to be of p2p type. We solve 
the NP-complete MWIS problem by means of a polynomial 
time approximation [6] and find a maximal weight subset 
of R that does not create invalid paths in P. We denote this 
subset as F and admit it as our final set of p2p links. 

2.5 Summary of inference heuristics 

In summary, our inference heuristics take as input a set 
of BGP paths P and a corresponding graph G{V, E) and 
perform the following three consecutive steps: 

1. Use IRRs to infer s2s relationships and create set S d E 
of s2s links; 

2. Remove the subset S from consideration and apply our 
heuristic assigning c2p/p2c relationships to the links 
remaining in _E \ 5"; 

3. Use P and G to infer p2p relationships and to create 
set F C -E of p2p hnks. 

The final result is set S of s2s links, set F of p2p links, and 
set E\F\S oi c2p links. 

2.6 Related work 

In comparison with other approaches to AS relationship 
inference, our heuristics offer a number of improvements. In 
contrast to DPP ^ and EHS [H], we identify not only c2p, 
but p2p and s2s relationships as well. Moreover, our c2p 
heuristic addresses the limitations we discussed in section [2Tl1 
with ToR solutions. 

The work by Subramanian et al. |26) introduced the ToR 
problem and the SARK heuristic [IS] for solving ToR. SARK 
used node coreness [3l[2], which reflects ASs' topological po- 
sitions in AS graphs, as a metric for inferring c2p and p2p 
relationships. In contrast, our heuristics use AS degrees and 



policies encoded in AS paths to infer c2p and p2p relation- 
ships. 

The work by Gao [15 used AS degrees and the valley- 
free model to infer c2p, p2p, and s2s relationships. GAO 
algorithm treats every AS path as a hint of true types of 
links in the path. It takes a set of AS paths as input, di- 
rects every link toward the highest degree AS in the path, 
and after parsing all paths, counts the directions each link 
has accumulated. If a link has received consistent directions 
throughout the process, it is marked as c2p with the provider 
being at the top of the directed link. Otherwise, the link is 
marked as s2s. Similarly to this work, our heuristics employ 
the valley-free model and AS degrees to infer c2p and p2p 
relationships, but we make a number of technical enhance- 
ments, which we outline in detail in sections l2.3l and[ 2. 41 We 
use the IRR databases to infer s2s relationships, since it is 
hard to reliably infer them from BGP data. 

Xia and Gao 33] used the IRR databases to extract rela- 
tionships among a subset of ASs and proposed a variation 
of the GAO heuristic that takes this subset as an input to 
infer other AS relationships. They demonstrated how accu- 
rate and current IRR databases provide explicit information 
on AS relationships. On the other hand, dealing with IRR 
data has its own intrinsic methodological problems: 1) it is 
much harder to automate; 2) the data is not always accurate 
and its accuracy level is hard to estimate; 3) not all ASs are 
registered. In our work we also use the IRR data but only 
for s2s relationship inference. For this task, we process the 
organization description records, which are relatively stable 
over time, compared to policy-related records. 

Mao et al. j20j proposed an AS relationship inference tech- 
nique that employs the valley-free model to infer c2p and 
p2p relationships. The technique introduces a set of new 
interesting ideas based on the assumption that ASs prefer 
shorter AS paths over longer AS paths. This assumption 
does not however hold when ASs use routing policies to 
select the next-hop AS on the basis of its policy ranking, 
regardless of AS path lengths. 

Recent work by Muehlbauer et al. (25] introduced a shift 
from inferring AS relationships to inferring AS paths using 
a model with agnostic AS relationships and multiple routers 
per AS. They found that their model leads to more accurate 
results, as far as accuracy of capturing path diversity is con- 
cerned, than a model using inferred AS relationships and 
a single router per AS. By definition, agnostic approaches 
cannot however capture precise characteristics of individual 
ASs. Therefore, agnostic approaches are not appropriate for 
tasks such as constructing realistic economy-based evolution 
models of ASs. In addition, |22l I29j assumed that mod- 
els with c2p/p2p/s2s relationships are equivalent to models 
with a single router per AS. The former models can how- 
ever be extended to use multiple routers per AS, and such 
extensions may result in significantly higher path diversity 
than [22] reported. 

3. APPLYING HEURISTICS TO THE DATA 
3.1 Collecting and sanitizing the data 

We first construct the input BGP path set P and the cor- 
responding graph G{y,E). We collected BGP tables from 
RouteViews 21 , at 8- hour intervals, over the period from 
03/01/2005 to 03/05/2005, for a total of 15 BGP table in- 
stances. After cleaning out AS prepending and AS sets, each 
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Figure 2: Stable paths set P vs. unprocessed path sets P^. 



Table 1: Number of unique degree- valleys and total number 
of degree-valleys found in stable and unstable path sets. 
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BGP table yields a path set Pk- 

Invalid paths caused by BGP misconfigurations occur quite 
often and affect 200-1200 BGP table entries each day [19]. 
To mitigate the impact of these misconfigurations, we sani- 
tize the input data as follows. We define the persistency of a 
path p £ U^Li ft as the number of sets Pk containing p. The 
persistency distribution (Figure 2(a) I shows that although 



the majority of the paths appear in most of the 15 sets, a 
significant number of paths appear only in a few of the sets. 
Since BGP misconfigurations are temporal events, we select 
as input P to our algorithm the paths that appear in all of 
the 15 sets Pk- We call P the stable path set and the paths 
that are not selected the unstable path set. 



In Figure 2(b) we compare the number of paths, links and 



ASs in the stable paths set with the corresponding averages 
over Pk- Even though P is 12.49% smaller than the average 
size of Pk, our filtering of unstable paths does not entail sig- 
nificant information loss in terms of number of links (4.34% 
reduction) and ASs (1.87% reduction). 

We also verify that the unstable paths include more non- 
hierarchical degree sequences, which is often an indication 
of a misconfiguration [19], than the stable paths. We call a 
degree-valley any AS sequence A-B-C with degrees cIa, cLb, 
and dc, such that both dA and dc are larger than ds plus 
a small margin constant w. dA,dc > dB + w. (The small 
margin w is added to filter out trivial differences between 
dfl and dA, dc-) Then, for both the stable and unstable 
path sets, we 100 times randomly select 10,000 paths and 
count the number of degree- valleys for different w- Table [T] 
shows the average number of unique degree-valleys and the 
average of the total number of degree- valleys in the selected 
paths. The number of degree-valleys in the unstable paths 
is between 17% and 94.4% larger than in the stable paths. 

3.2 Inferring AS relationships 

s2s relationships. To infer s2s relationships in our graph 
we use the RIPE, ARIN, and APNIC databases, collected 
on 06/10/2004|_] We analyze the databases according to the 



methodology outlined in section [2!2l and find 1,943 organi- 
zations that own multiple AS numbers. We then examine 
the input graph G and discover 177 edges between ASs that 
belong to the same organization {\S\ = 177). 

c2p relationships. We remove edges inferred as s2s 
from E and apply our methodology detailed in section[2]3]to 
the remaining links E\S- Our implementation uses parts of 
the code from EHS [14], the LEDA v4.5 software library [l], 
and a publicly available SDP solver DSDP v4.7 [5]. We com- 
pute orientations of the edges in E \ S for different values 
of a, sampling densely the interval between and 1. Recall 
that when a = 1, our problem formulation is equivalent to 
the original ToR formulation, whereas a — corresponds to 
entirely degree-based relationship inference. 

To evaluate the computed orientations, we introduce a 
metric called reachability- We define reachability of an AS X 
as the number of ASs one can reach from this AS traversing 
only p2c edges. The reachability of an AS has the following 
two properties: 1) it is determined entirely from the inferred 
c2p relationships; and 2) it induces a natural hierarchy of 
ASs based on the size of their customer trees. These two 
properties enable us to perform an initial validation of the 
inferred c2p relationships by matching the top ASs in the 
calculated hierarchy against the empirically known largest 
ISPs in the Internet. 

We sort all ASs by their reachability, and group ASs with 
the same reachability into levels- ASs at the highest level 
have the largest trees of customer ASs. ASs at the low- 
est level have the smallest reachability. We then define the 
position depth of an AS X as the number of ASs at the 
reachability levels above the level of the AS X- We define 
the position width of an AS X as the number of ASs at the 
same level as the AS X. 

In Table [5] we examine the top five ASs in the hierar- 
chies calculated for the two extreme cases, a = and a = 1. 
When a = 0, the well-known ISPs: UUNET, AT&T, Sprint, 
Level 3, and Qwest occupy the top five positions in the hier- 
archy. On the other hand, when a = 1, these positions are 
taken by ASs of very small degrees, e.g., AS13987 of degree 
3. The columns in the table track the position of these ASs 
in the hierarchies induced for a equal to 0, 0.01, 0.05, 0.1, 
0.5, and 1. We observe that as a gets closer to 1, the well- 
known ASs drift away from the top of the hierarchies, thus 
highlighting an increasingly stronger deviation from reality. 
This deviation is maximized when a = 1 (the original ToR 
formulation) , demonstrating the limitations of AS relation- 
ship inference based solely on maximization of the number 
of valid paths. 

The induced hierarchies suggest that solutions with val- 
ues of a close to 1 are incorrect since they propel small ASs 
to the top of the hierarchy, while well-known ISPs sink to 
lower positions. On the other hand, the percentage of in- 
valid paths, listed at the top of Table[5J attains its maximum 
of 12.75% when a — 0. The latter observation suggests that 
the solution with a = is also incorrect since a large num- 
ber of paths violates the valley-free routing model. Taken 
together, these two observations indicate that intermediate 
values of a yield best solutions to our multiobjective opti- 
mization resulting both in realistic hierarchies and in small 
numbers of invalid paths. We emphasize that there is no ora- 
cle, intrinsic to the multiobjective optimization problem for- 



Since we extract from these databases the information that 
changes slowly with time, the date of the database dump is 



not critically important. 



Table 2: The reachability-based hierarchy of ASs and percentage of invaUd paths as functions of a. For different values of a, 
we show the position depth (the number of AS at the levels above) and width (the number of ASs at the same level) for the ten 
ASs that occupy the top five positions when a takes its two extreme values: a — and a — 1. The AS numbers are matched to 
AS names using the W^HOIS databases. 
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Table 3: Summary statistics of the inferred relationships. 
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mulation, that would reveal the proper balance between the 
two objectives and the corresponding "right" value of a. As 
is typically the case with multiobjective optimization [11| . 
we must exercise our external expert knowledge of data 
specifics to sift out the most realistic relative weight of the 
objectives. For our problem, we formalize this expert in- 
sight as follows: we search for the value of a corresponding 
to the smallest percentage of invalid paths among all the 
solutions that have only well-known ISPs at the top of the 
hierarchy. In our experiments, this most realistic value of a 
is 0.01 (cf. Table [2]). 

p2p relationships. We implement our p2p heuristic de- 
tailed in section [?!4l using the QUALEX ^ solver to approx- 
imate the MWIS problem. We then infer p2p relationships 
in the AS topology G and construct the set F of p2p links. 
After removing from F the set S of s2s links, we obtain our 
final answer that contains 3, 553 p2p links (|_F\S| = 3, 553). 

Table|3]summarizes our results for the whole graph G{V, E) . 

3.3 Repository of AS Relationships and AS Rank 

To make our results easily accessible and practically useful 
for the community, we automated our inference heuristics. 
We archive the inferred AS relationships on a weekly basis 
and make them available for download at the AS relationship 
data repository [7]. 

We also created an interactive web site [8] where we apply 
our automated relationship inferences to rank ASs based on 
their customer cones. We define the customer cone of an 
AS A as the AS A itself plus all the ASs that it can reach 
"for free" , that is, following only p2c and s2s links. In other 
words, AS A's customer cone is A, plus A's customers, plus 
its customers' customers, and so on. We use the following 
three metrics to measure the size of customer cones: the 
number of ASs in the cone, the number of unique prefixes 



advertised by these ASs, and the number of /24 blocks in 
the union of these prefixes. 

AS ranking is valuable not only for conceptual under- 
standing of relative importance of Internet players, but also 
for network vendors and operators in prioritizing their cus- 
tomer lists and in solving other practical tasks. Users of our 
AS ranking have an option to group multiple sibling ASs 
into one organizational entity by specifying sibling groups 
either from the IRRs data, or as user-provided sibling lists. 

4. SURVEY AND VALIDATION 

Measuring, understanding, and modeling AS relationships 
in the Internet are challenging tasks hampered by the fact 
that these relationships are sensitive business information 
and generally considered private by ISPs. Nevertheless, 
without validation against truth, we have no way of eval- 
uating the integrity of our heuristics. 

Most of the previous works relied on implicit validation. 
However, indirect approaches are not always reliable. For 
example, the authors of [26l |4j [141 [20] used the number of 
valid paths as an indicator of the accuracy of the inferred 
relationships. As we discussed in section [2m a large number 
of valid paths does not necessarily result in a large number 
of correctly inferred AS relationships. 

In contrast with previous works, we augment our valida- 
tion based on implicit metrics (e.g., reachability, section [3^ 
with the explicit data that we collected via private commu- 
nications with engineers from the ASs under observation. 

We contacted several ASs ranging from large continental 
or national ISPs, to content providers, and university net- 
works. We sent the list of AS relationships that we inferred 
for a given AS to this AS's network administrator, peering 
negotiator, informed engineer, or researcher. We included 
three questions in our email inquiry y 



^ We also offered to sign a non-disclosure agreement (NDA) 
that protected peering information from being released to 
the public and regulated our data analysis to anonymizing 
the participating ASs. Only one organization (a government 
agency) required an NDA and two commercial ISPs did not 
have a policy in place (or the policy was not) to deal with 
such requests. They still helpfully provided us with general 
answers regarding what percent of peers we inferred incor- 



Table 4: Validation of the inference results using the survey 
data. Each row shows the total number, number of correct, 
and percentage of correct inferred AS relationships. 
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Ql: For the listed inferred AS relationships, specify how 

many are incorrect, and what are the correct types of 

the relationships that we mis-inferred? 
Q2: What fraction of the total number of your AS neighbors 

is included in our list? 
Q3: Can you describe any AS relationships, more complex 

than c2p, p2p, or s2s, that are used in your networks? 

We performed the survey in the period between 06/07/05 
and 06/30/05 and received answers from 38 out of the 78 
ASs we contacted. Among these, 5 were tier-1 ISPs, 13 
were smaller ISPs, 19 were universities, and 1 was a content 
provider. These ASs reported to us the true relationship 
types for 3,724 of our inferred AS relationships. The univer- 
sities reported only 54 of those 3,724 relationships, whereas 
all the remaining relationships came from the ISPs and the 
content provider. The BGP-derived AS degrees for the uni- 
versities ranged from 1 to 8, while for the remaining ASs it 
ranged from 1 to almost 2000. 

4.1 Validation of inferred AS relationships 

We validate our heuristics by counting the number of cor- 
rectly inferred AS relationships. Among the 3,724 verified 
AS relationships, 82.6% were c2p, 16.1% were p2p, and 1.2% 
were s2s. Table |4] demonstrates that our heuristics correctly 
infer 96.5% of c2p, 82.8% of p2p, and 90.3% of s2s relation- 
ships. The total percentage of correctly inferred AS rela- 
tionships is 94.2%. This accuracy level demonstrates that 
our heuristics produce reliable and veracious inferences of 
the true types of AS relationships in the Internet. 

The data in our survey bear certain limitations and our 
results should be interpreted accordingly. First, the self- 
selection aspect of the sampling of ASs may induce biases 
into the resulting statistics. Second, the obtained 3,724 
links with confirmed relationships represent 9.7% of the to- 
tal number of links in our data. While acknowledging these 
limitations, we note that providing rigorous validation of in- 
ferred AS relationships is an extremely challenging task be- 
cause of the difficulty in collecting ground-truth data against 
which one can check the inferences. 

4.2 Missing AS links 

In this section we analyze the relationships of the full set 
of adjacencies of the participating ASs, including the links 
that we do not see in BGP tables and, consequently, in our 
graph. The second question in our survey asks ASs for the 
ratio of the number of their neighbors in our AS topology 
data to the total number of AS neighbors they actually have. 
Out of the 38 ASs, 27 (3 of which were tier-1 ISPs) provided 
us not only with this ratio, but also with the types of rela- 
tionships their ASs have with the missing neighbors. Out of 
the total of 1,114 true reported adjacencies, the BGP tables 

rectly. 



Figure 3: Numbers of true and observed AS links for dif- 
ferent types of AS relationships in the survey. 



observe only 552. This finding agrees with the conclusion 
of previous works [9] [23] that a significant number of exist- 
ing AS connections remain hidden from most BGP routing 
tables. 

To improve our understanding of the missing AS links, 
we analyze the true relationships of these links. Figure [3] 
illustrates the per-relationship breakdown of the true and 
observed adjacencies of the ASs. It shows that we only see 
38.7% out of the 865 p2p links, whereas we see 86.7% out of 
the 218 c2p links, and 93.3% of the 30 s2s links. This gap 
demonstrates that BGP-derived AS topologies miss predom- 
inantly p2p links. 

The reasons for this bias stem from the intrinsic nature 
of p2p relationships. In a p2p relationship, prefixes learned 
from a peer AS are not advertised to any providers. Conse- 
quently, a link between two p2p ASs is not seen (as a part of 
some AS path) at any upstream ASs. It follows that we can 
only observe a p2p link in the BGP tables of the customers 
or siblings of the p2p ASs. The periphery of the Internet has 
many small interconnecting ASs. Thus, in order to observe 
p2p links in the periphery, we should have a significant num- 
ber and variety of BGP tables from these small ASs. BGP 
tables with small number of data feeds alone do not provide 
representative statistics of p2p links. 

Figure [3] also shows that the majority of the 1,114 true 
adjacencies are in reality p2p: 865 (77.6%) are p2p, while 
only 218 (19.6%) and 30 (2.7%) are c2p and s2s, respec- 
tively. We thus face a large number of p2p relationships 
which appear to be very popular among small and medium 
size ASs. Interestingly, some tier-1 ISPs have several dozens 
or even hundreds of p2p relationships, frequently with ASs 
of smaller size. 

Next, we seek to evaluate how representative the BGP- 
derived AS degrees are of the true AS degrees. In Figure!?] 
we plot the number of true AS adjacencies of the surveyed 
ASs versus the number of AS adjacencies derived from our 
BGP data. At the bottom-left corner of the diagram, 20 
ASCi that are mainly university networks, have their true 
numbers of adjacencies close or identical to the measured 
numbers of adjacencies. We find that most of the adjacencies 
of these small ASs are c2p links. As we have seen above, our 
AS topology captures c2p links relatively well. Examining 
the rest of the diagram, we first observe that the percentage 
of missed adjacencies can be as large as 86.2%. The degrees 
for most of the highly connected ASs are under-sampled, 
half of them missing more than 70.5% links. Further exami- 
nation of the missed AS links reveals that most of them are 



^ Note that points (1,1) and (2,2) in the figure correspond 
to more than one AS. 



Observed AS adjacencies versus true AS adjacencies 



AS does not provide BGP ft 
AS provides BGP h 



+ +86.2% 

+ 71.0% 



f+ ^57.1% 
+ 50.0% 



True AS adjacencies 



Figure 4: True vs. observed degrees of the surveyed ASs. 
We mark 1) the percentage of missed links for a few ASs "with 
the highest values of this percentage; and 2) the ASs that did 
(not) provide feeds in our BGP data. 



of p2p type, which is consistent with Figure [3] 

Our resuhs confirm the common assumption that p2p re- 
lationships, while widespread in the Internet, are not amenable 
to observation from few BGP feeds and can render BGP- 
derived AS degrees significantly smaller than the true AS 
degrees. On one hand, the identified deficiencies should in- 
spire further pursuit of representative statistics on the num- 
ber of p2p links, for example via the deployment of dis- 
tributed measurement infrastructures [25]. On the other 
hand, we emphasize that these missing links do not qualita- 
tively change a set of frequently-used statistical characteris- 
tics of the BGP-derived AS topologies [T2l [TSl [TO] . 

4.3 Complex AS relationships 

The last question of our survey asks about more complex 
configuration scenarios the AS may be using. From the re- 
sponses we learn that although the majority of AS relation- 
ships are simply c2p or p2p, in few cases their configurations 
are either more specialized versions of the basic c2p or p2p 
types, or a hybrid of c2p and p2p (c2p/p2p). 

For example, the backup provider relationship is a special- 
ized variant of the basic c2p relationship. In this case, a cus- 
tomer AS has a c2p relationship with a provider AS, but this 
relationship only allows traffic to fiow during an emergency 
situation such as a disruption of connectivity to the main 
upstream provider of the customer AS. Hence, the backup 
provider relationship is a temporally conditioned version of 
the c2p relationship. 

A hybrid c2p/p2p relationship occurs when two ISPs in- 
terconnect at multiple peering points and have different types 
of relationships at these points. For instance, two ISPs can 
have a p2p relationship at a peering point in Europe and a 
c2p relationship at a peering point in the U.S. Another fla- 
vor of a hybrid c2p/p2p relationship is when two ASs have 
different types of relationships for different IP prefixes. In 
this case the ISPs may have a p2p relationship for one set of 
IP prefixes and a c2p relationship for another set of IP pre- 
fixes. These examples of hybrid c2p/p2p relationships illus- 
trate that AS relationships may involve also spatial and/or 
prefix-based aspects. 

In other words, based on the configuration descriptions we 
collected in our survey, we conclude that AS relationships 
vary across the following three dimensions: space, time, and 
prefix. Therefore, to fully characterize a relationship be- 



tween a pair of ASs, including more complex relationship 
scenarios, one has to gain access to information identify- 
ing the ASs' policy configurations per peering location(s), 
per time, and per prefix. Although limited per-prefix and 
per-time data are presently available, identifying more com- 
plex relationships for the complete Internet AS topology 
is a formidable task as it likely requires significantly more 
sources of more detailed data than currently available. 

A natural question that arises is how a c2p, p2p, or s2s 
inference for an AS link, which in fact is a more complex re- 
lationship, distorts reality and how prominent this artifact 
is. For a backup relationship, a c2p, p2p, or s2s inference 
misses the temporal component of the relationship. For a 
hybrid c2p/p2p relationship, a c2p or p2p inference misses 
one part of the hybrid relationship. Such artifacts, however, 
do not occur often. Indeed, more complex relationships are 
likely to exist only between large ASs. However, most AS 
links in the Internet connect small ASs to large ASs or con- 
nect small ASs to each other [18]. Such AS pairs are known 
to employ consistent routing policies over their usually single 
or sometimes multiple peering points. 

5. CONCLUSION 

The relationships among ASs in the Internet represent the 
outcome of policy decisions governed by technical and busi- 
ness factors of the global Internet economy. Precise knowl- 
edge of these relationships is therefore an essential building 
block needed for any reliable and effective analysis of techni- 
cal and economic aspects of the global Internet, its structure, 
and its growth. 

In this work we introduced novel heuristics that signifi- 
cantly improve the state-of-the-art in inferring c2p relation- 
ships and carefully address the particularly difficult prob- 
lems of inferring p2p and s2s relationships from currently 
available data. 

In comparison with previous studies that primarily used 
implicit validation of inferred AS relationships, we go a step 
further. In addition to implicit validation, we make an effort 
to collect explicit ground-truth data via direct communica- 
tion with ASs. Using the true relationships of 3,724 links we 
confirmed that our heuristics achieve very high accuracy of 
96.5% (c2p), 82.8% (p2p), and 90.3% (s2s) of correctly in- 
ferred relationships, with the overall accuracy being 94.2%. 
Given the overall difficulty of validating inference results and 
that surveys like the one in this paper tend to be extremely 
involved procedures in practice, we hope that our work will 
serve to cast ponderable confidence on such inference stud- 
ies. 

Using the data of our survey we followed previous stud- 
ies [5] [23] in finding that measured AS topologies miss a 
significant number of AS links. We take this result further 
by verifying the commonly held assumption that most of the 
missing links are of p2p type. 

Easy access to accurate AS relationship data is essential 
to a variety of studies dealing with aspects of Internet archi- 
tecture and policy. To support the research community with 
as objective data as possible, we have automated our heuris- 
tics and calculate and archive AS relationships on a weekly 
basis [?]. As an example of using the inferred relationships 
we provide a ranking of ASs [S]. 

From the perspective of empirical research, the global In- 
ternet compares to an economy or an ecosystem. As such, 
cross-disciplinary approaches that combine knowledge of the 



Internet macroscopic structure with insights into its eco- 
nomics and policy are required to advance our understand- 
ing of its technical and economical viability. We believe our 
work significantly benefits Internet research that strives to 
build more encompassing models validated against reliable 
and accurate data. 
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